L3: Dplyr and Basic Visualization

Bogdan G. Popescu

John Cabot University

Background

Dplyr and Basic Visualization

Today, we’ll learn how to:

  • Load and inspect real-world datasets
  • Clean and prepare data for analysis
  • Use dplyr to group and summarize values
  • Join datasets using a common key
  • Create and customize scatterplots using ggplot2

Opening a File

Download the following datasets from Dropbox:

Opening a File

Now put them in your working directory

Place the files in a folder named “data” inside your “week3” directory

Creating a New Quarto Document

We will now create a new quarto document corresponding to the week 3 lab

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Press CMD + A or Ctrl + A and then Press Delete

Using R

Using R

Then type:

---
title: "Notebook"
author: "Your Name"
date: "July 26, 2025"
format:
  html:
    toc: true
    number-sections: true
    colorlinks: true
    smooth-scroll: true
    embed-resources: true
---

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Opening the File

We first remove what we had previously

# Remove all objects from memory
rm(list = ls())

Opening the File

To get the file path we simply go to the relevant folder

# This opens a file dialog to select your file
file_path <- file.choose()
file_path

Opening the File

Once we have the path, we can now read the files:

# Defining Paths
path_data <- "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/essential_stats/week3/lab/data/"
# Use file.path() to construct full path
life_expectancy_df <- read.csv(file.path(path_data, "life-expectancy.csv"))
urbanization_df <- read.csv(file.path(path_data, "share-of-population-urban.csv"))

Please change the path to your computer’s path.

What We Want to Do

Explanation

These are our datasets

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

This is what this looks like in code:

library(dplyr)
life_expectancy_df2<-life_expectancy_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.))

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation

What We Want to Do

Explanation: Recap

Doing it

1. Calculating Average by Country Code

This is what this looks like in code for life_expectancy_df2:

library(dplyr)
life_expectancy_df2<-life_expectancy_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.))

This is what this looks like in code for urbanization_df2:

urbanization_df2<-urbanization_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(urb_mean=mean(Urban.population....of.total.population.))

Doing it

2. Inspect the first entries

This is how we inspect the first entries for life_expectancy_df2:

head(life_expectancy_df2, n=3)

This corresponds to the first three entries of the dataframe:

Doing it

2. Inspect the first entries

This is how we inspect the first entries for urbanization_df2:

head(urbanization_df2, n=3)

This corresponds to the first three entries of the dataframe:

Doing it

2. Inspect the first entries

We notice the country codes.

unique(life_expectancy_df2$Code)
  [1] ""         "ABW"      "AFG"      "AGO"      "AIA"      "ALB"     
  [7] "AND"      "ARE"      "ARG"      "ARM"      "ASM"      "ATG"     
 [13] "AUS"      "AUT"      "AZE"      "BDI"      "BEL"      "BEN"     
 [19] "BES"      "BFA"      "BGD"      "BGR"      "BHR"      "BHS"     
 [25] "BIH"      "BLR"      "BLZ"      "BMU"      "BOL"      "BRA"     
 [31] "BRB"      "BRN"      "BTN"      "BWA"      "CAF"      "CAN"     
 [37] "CHE"      "CHL"      "CHN"      "CIV"      "CMR"      "COD"     
 [43] "COG"      "COK"      "COL"      "COM"      "CPV"      "CRI"     
 [49] "CUB"      "CUW"      "CYM"      "CYP"      "CZE"      "DEU"     
 [55] "DJI"      "DMA"      "DNK"      "DOM"      "DZA"      "ECU"     
 [61] "EGY"      "ERI"      "ESH"      "ESP"      "EST"      "ETH"     
 [67] "FIN"      "FJI"      "FLK"      "FRA"      "FRO"      "FSM"     
 [73] "GAB"      "GBR"      "GEO"      "GGY"      "GHA"      "GIB"     
 [79] "GIN"      "GLP"      "GMB"      "GNB"      "GNQ"      "GRC"     
 [85] "GRD"      "GRL"      "GTM"      "GUF"      "GUM"      "GUY"     
 [91] "HKG"      "HND"      "HRV"      "HTI"      "HUN"      "IDN"     
 [97] "IMN"      "IND"      "IRL"      "IRN"      "IRQ"      "ISL"     
[103] "ISR"      "ITA"      "JAM"      "JEY"      "JOR"      "JPN"     
[109] "KAZ"      "KEN"      "KGZ"      "KHM"      "KIR"      "KNA"     
[115] "KOR"      "KWT"      "LAO"      "LBN"      "LBR"      "LBY"     
[121] "LCA"      "LIE"      "LKA"      "LSO"      "LTU"      "LUX"     
[127] "LVA"      "MAC"      "MAF"      "MAR"      "MCO"      "MDA"     
[133] "MDG"      "MDV"      "MEX"      "MHL"      "MKD"      "MLI"     
[139] "MLT"      "MMR"      "MNE"      "MNG"      "MNP"      "MOZ"     
[145] "MRT"      "MSR"      "MTQ"      "MUS"      "MWI"      "MYS"     
[151] "MYT"      "NAM"      "NCL"      "NER"      "NGA"      "NIC"     
[157] "NIU"      "NLD"      "NOR"      "NPL"      "NRU"      "NZL"     
[163] "OMN"      "OWID_KOS" "OWID_WRL" "PAK"      "PAN"      "PER"     
[169] "PHL"      "PLW"      "PNG"      "POL"      "PRI"      "PRK"     
[175] "PRT"      "PRY"      "PSE"      "PYF"      "QAT"      "REU"     
[181] "ROU"      "RUS"      "RWA"      "SAU"      "SDN"      "SEN"     
[187] "SGP"      "SHN"      "SLB"      "SLE"      "SLV"      "SMR"     
[193] "SOM"      "SPM"      "SRB"      "SSD"      "STP"      "SUR"     
[199] "SVK"      "SVN"      "SWE"      "SWZ"      "SXM"      "SYC"     
[205] "SYR"      "TCA"      "TCD"      "TGO"      "THA"      "TJK"     
[211] "TKL"      "TKM"      "TLS"      "TON"      "TTO"      "TUN"     
[217] "TUR"      "TUV"      "TWN"      "TZA"      "UGA"      "UKR"     
[223] "URY"      "USA"      "UZB"      "VAT"      "VCT"      "VEN"     
[229] "VGB"      "VIR"      "VNM"      "VUT"      "WLF"      "WSM"     
[235] "YEM"      "ZAF"      "ZMB"      "ZWE"     

Doing it

2. Inspect the first entries

We notice the country codes.

Doing it

2. Inspect the first entries

We notice the country codes.

Doing it

2. Inspect the first entries

So we have some unusual entries such as:

  • “”
  • “OWID_KOS”
  • “OWID_WRL”.

Doing it

2. Inspect the first entries

Which countries do these codes refer to? We can subset the data to find out.

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
weird_countries<-subset(life_expectancy_df2, (Code %in% weird_labels))

Doing it

2. Inspect the first entries

weird_countries2<-subset(weird_countries, !duplicated(weird_countries$Code))
head(weird_countries2, n=10)

So, these are not the typical country codes made out of three letters

Doing it

3. Cleaning our dataset

We can now clean the dataset

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df2, !(Code %in% weird_labels))
head(clean_life_expectancy_df, n=5)

Doing it

3. Cleaning our dataset

Let’s repeat the procedure for the other dataset

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_urbanization_df<-subset(urbanization_df2, !(Code %in% weird_labels))
head(clean_urbanization_df, n=5)

Doing it

4. Performing a left join

We will now perform a left join to combine urbanization data with life expectancy data based on Code.

Doing it

4. Performing a left join

We will now perform a left join to combine urbanization data with life expectancy data based on Code.

merged_data<-left_join(clean_life_expectancy_df, clean_urbanization_df, by = c("Code"="Code"))
head(merged_data, n=10)

Doing it

5. Getting rid of NAs for urbanization

This is how we remove NA values

merged_data2<-subset(na.omit(merged_data))
head(merged_data2, n=10)

Doing it

5. Getting rid of NAs for urbanization

How many countries did we drop?

nrow(merged_data) - nrow(merged_data2)
[1] 21

We dropped 21.

Doing it

5. Getting rid of NAs for urbanization

How many countries did we drop?

nrow(merged_data) - nrow(merged_data2)
[1] 21

We dropped 21.

Doing it

5. Getting rid of NAs for urbanization

What are those countries?

countries_dropped<-merged_data[!complete.cases(merged_data), ]
countries_dropped$Code
 [1] "AIA" "BES" "COK" "ESH" "FLK" "GGY" "GLP" "GUF" "JEY" "MAF" "MSR" "MTQ"
[13] "MYT" "NIU" "REU" "SHN" "SPM" "TKL" "TWN" "VAT" "WLF"

Doing it

5. Getting rid of NAs for urbanization

Let us have a quick look at what these codes mean.

cntries<-subset(life_expectancy_df, Code %in% countries_dropped$Code)
unique(cntries$Entity)
 [1] "Anguilla"                        "Bonaire Sint Eustatius and Saba"
 [3] "Cook Islands"                    "Falkland Islands"               
 [5] "French Guiana"                   "Guadeloupe"                     
 [7] "Guernsey"                        "Jersey"                         
 [9] "Martinique"                      "Mayotte"                        
[11] "Montserrat"                      "Niue"                           
[13] "Reunion"                         "Saint Helena"                   
[15] "Saint Martin (French part)"      "Saint Pierre and Miquelon"      
[17] "Taiwan"                          "Tokelau"                        
[19] "Vatican"                         "Wallis and Futuna"              
[21] "Western Sahara"                 

Doing it

6.Creating a Scatterplot

ggplot2 is one of the most elegant and most versatile graphing libraries in R

Before we use ggplot we need to install and load the library:

install.packages("ggplot2")

Note: this has to be ggplot2, not ggplot.

Doing it

6.Creating a Scatterplot

Now, you have to load the package

library(ggplot2)

Note: this has to be ggplot2, not ggplot.

Doing it

6.Creating a Scatterplot

Doing it

7. Saving our first plot

Within our “week3/lab” folder create another folder called “graphs”.

Doing it

7. Saving our first plot

Within our “week3/lab” folder create another folder called “graphs”.

Next copy and paste the life-expectancy.csv from “data” to “graphs”

Doing it

7. Saving our first plot

Next copy and paste the life-expectancy.csv from “data” to “graphs”

Doing it

7. Saving our first plot

Figure out the path of the “graphs” folder by using file.choose()

file.choose()

Doing it

7. Saving our first plot

Figure out the path of the “graphs” folder by using `file

Doing it

7. Saving our first plot

Figure out the path of the “graphs” folder by using `file

Doing it

7. Saving our first plot

Figure out the path of the “graphs” folder by using `file

Doing it

7. Saving our first plot

Figure out the path of the “graphs” folder by using `file

Doing it

7. Saving our first plot

Figure out the path of the “graphs” folder by using `file

Doing it

7. Saving our first plot

Figure out the path of the “graphs” folder by using `file

Doing it

7. Saving our first plot

Setting the working directory

#Setting path
setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/essential_stats/week3/lab")

Relative Path Option

ggsave(fig1, file = "./graphs/fig1.jpg", 
       height = 20, width = 20, 
       units = "cm", dpi = 300)

Absolute Path Option

Change the first part to your own path

ggsave(fig1, file = "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/essential_stats/week3/lab/graphs/fig1.jpg", 
       height = 20, width = 20, 
       units = "cm", dpi = 300)

Doing it

7. Saving our first plot

Once you figured out the path, delete file.choose()

ggplot options

1. Improving X and Y axis labels

fig2<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")
fig2

ggplot options

2. Adding a Title

fig2<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")
fig2

ggplot options

3. Getting rid of the grey background

fig2<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()
fig2

ggplot options

4. Fitting a line

fig2<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)

fig2

ggplot options

5. Fixing the X and Y axis to particular values

fig3<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(50,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(50,100))
fig3

ggplot options

6. Two Graphs Side by Side

library(gridExtra)
grid.arrange(fig2, fig3, ncol=2)

ggplot options

7. Labelling the Data

There are many ways of labeling the data in ggplot. But let us try a simple way.

First let us do a quick left join so that we keep the original country names

label_cntry<-subset(life_expectancy_df, select = c("Entity", "Code"))
label_cntry<-subset(label_cntry, !duplicated(Code))
merged_data3<-left_join(merged_data2, label_cntry, by = c("Code"="Code"))

ggplot options

7. Labelling the Data

fig3<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  geom_text(aes(label = Entity),
            size = 4, 
            check_overlap = TRUE, 
            position = position_nudge(y = 1))
fig3

ggplot options

7. Labelling the Data

ggplot options

7. Labelling the Data

What if we wanted to do selective labeling? Let’s say, we just want to label Italy and the US. How could we do that?

Subsetting the data to two countries

countries_of_interest = c("Italy", "United States")
italy_us_df<-subset(merged_data3, Entity %in%  countries_of_interest)

ggplot options

7. Labelling the Data

fig4<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  geom_text(aes(label = Entity),
            size = 4, 
            check_overlap = TRUE, 
            position = position_nudge(y = 1),
            data = italy_us_df)
fig4

ggplot options

7. Labelling the Data

This does not work so well

ggplot options

7. Labelling the Data

The better way is called geom_label.

fig4<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  geom_label(aes(label = Entity),
            size = 4, 
            position = position_nudge(y = 1),
            data = italy_us_df)
fig4

ggplot options

7. Labelling the Data

The better way is called geom_label.

ggplot options

7. Labelling the Data

But which one is Italy and which one is the US?

library(ggrepel)
fig4<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
 geom_label_repel(box.padding = 0.5,
    aes(label = Entity),
            size = 4, 
            position = position_nudge(y = 1),
            data = italy_us_df)
fig4

ggplot options

7. Labelling the Data

But which one is Italy and which one is the US?

Conclusion

  • We used dplyr to group, summarize, and clean data
  • We merged datasets with left_join()
  • We built scatterplots using ggplot2
  • We customized labels, titles, themes, and axes